A high-quality chromosome-scale genome assembly of blood orange, an important pigmented sweet orange variety

Blood orange (BO) is a rare red-fleshed sweet orange (SWO) with a high anthocyanin content and is associated with numerous health-related benefits. Here, we reported a high-quality chromosome-scale genome assembly for Neixiu (NX) BO, reaching 336.63 Mb in length with contig and scaffold N50 values of 30.6 Mb. Furthermore, 96% of the assembled sequences were successfully anchored to 9 pseudo-chromosomes. The genome assembly also revealed the presence of 37.87% transposon elements and 7.64% tandem repeats, and the annotation of 30,395 protein-coding genes. A high level of genome synteny was observed between BO and SWO, further supporting their genetic similarity. The speciation event that gave rise to the Citrus species predated the duplication event found within them. The genome-wide variation between NX and SWO was also compared. This first high-quality BO genome will serve as a fundamental basis for future studies on functional genomics and genome evolution.


Background & Summary
Sweet orange (SWO, Citrus sinensis L. Osbeck) is the most important citrus species 1 .SWO varieties are typically categorized into three subgroups based on their agronomical characteristics: common orange, navel orange, and blood orange (BO) 2 .BO stands out for brilliant red coloration of both flesh and rinds 3 , which is not usually found in Citrus L. 4,5 .
Anthocyanins, which belong to a large family of flavonoids, are accountable for the characteristic red color of BO 3 .In addition to contributing to pigmentation 3 , anthocyanins have various health-promoting benefits in humans, such as their antioxidant capacity and potential for cancer prevention 6 .As consumers become increasingly health-conscious, the popularity of BO has been growing worldwide 7 because of its exceptional nutraceutical attributes, including vitamins, sugars, dietary fiber, minerals, and flavonoids, particularly anthocyanins 8 .
Moro, Tarocco, and Sanguinello are the three most important commercial BO types 9 .Moro has the deepest red color among the three varieties, followed by Sanguinello and Tarocco 4,9 .Tarocco is a medium-sized seedless variety famous for its peelability and sweetest taste 2 .In our long-term BO breeding program, we have discovered an unexpected and natural bud mutation of Tarocco, which we have named '内秀' (Neixiu, NX).In Chinese wisdom, '内秀' is used to describe a person who looks pretty ordinary, but he is intelligent in an understated way.Based on more than 5 years of careful observation, we found that NX surpasses common Tarocco in terms of both sweetness and redness in the Southwest region of China (Fig. 1).Consequently, we consider NX to be a highly promising BO cultivar.
Recent advancements in sequencing technology and associated bioinformatic tools have significantly expedited citrus genomic studies.To date, three genomes of the SWO variety have been released.In 2013, the first draft of a di-haploid SWO genome was complied using short Illumina reads 10 .Subsequently, Wang et al. 11 successfully generated a de novo reference genome of the di-haploid SWO using the Nanopore ultra-long and PacBio long-read sequencing platforms.More recently, Wu et al. 12 accomplished the assembly of a diploid SWO genome at the chromosome level, specifically for the 'Valencia' variety.However, it is worth noting that genomic data for this important BO in the citrus industry is currently unavailable.In the investigation of BO functional genomics and genetics, the initial task involves the interpretation of genomic data.
Therefore, in the present study, we constructed a high-quality chromosome-scale genome assembly of BO by combining Illumina sequencing, third-generation circular consensus sequencing (CCS), and high-throughput chromosome conformation capture (Hi-C) sequencing.This integrated methodology resulted in a genome size of approximately 336.63 Mb, with a contig N50 value of 30.6 Mb.A total of 96% of the assembled sequences were successfully anchored to nine pseudo-chromosomes (Table 1).To investigate the evolutionary patterns of genes and genomes, comparative genomic analyses were performed on the BO genome and 11 other genomes representing various plant species.The study presents the first high-quality chromosome-scale genome of BO.The dataset generated from this research will significantly contribute to the advancement of our knowledge in BO functional genomics and the trajectory of citrus genomes.

Methods
Plant materials.For genome sequencing, young leaf samples were randomly collected from five-year-old NX trees.Samples were immediately frozen in liquid nitrogen, followed by preservation at −80 °C until DNA and RNA extraction.For RNA extraction, fresh plant tissues including leaves, fruits, buds, roots, and branches were obtained from the same tree.The 'Valencia' SWO 11 was used in the bioinformatics analysis.
Library construction and sequencing.Genomic DNA and total RNA were extracted using DNeasy Plant Mini Kit and RNeasy Plus Mini Kit (Qiagen, Valencia, CA, USA), respectively, according to the manufacturer's instructions.After extraction, short-read (350-bp) libraries were constructed using a library construction kit (Illumina, San Diego, CA, USA) and then sequenced on a Novaseq 6000 platform (Illumina), which finally generated a total of 24.21 Gb of raw data, covering 74.66 × of the genome.The resulting clean reads were used for genome surveys, including the evaluation of genome size, GC content, and heterozygosity.
PacBio sequencing libraries were constructed by Biomarker Technologies Corporation (Beijing, China) using the SMRTbell ® express template prep kit 2.0 (PacBio, Menlo Park, CA, USA).Before library preparation, genomic DNA was sheared into 15 kb fragments using Megaruptor ® 3 (Diagenode, Denville, NJ, USA).A total of 21.21 Gb high-fidelity (HiFi) clean data with an N50 value of 19.36 kb and an average read length of 18.88 kb were produced using the CCS mode on a PacBio Sequel II platform with the Sequel sequencing kit 2.0 (PacBio).These data are equivalent to 65 × coverage of the entire genome.were used identify the full-length repeat retrotransposons (LTR-RTs).High-quality intact full-length LTR-RTs and non-redundant LTR libraries were produced from the outputs of LTR_retriever v.2.9.0 25 .By combining the de novo TE library with known TEs in RepBase v.19.06 26 , REXdb v.3.0 27 , and Dfam v.3.5 22 , a non-redundant species-specific TE library was obtained.The final TEs were identified and classified through a homology search against the library using RepeatMasker v.4.1.4 28.Tandem repeats were annotated using Tandem Repeats Finder 29 and MISA v.2.1 30 .In the BO genome, we identified 127.82 Mb (37.97%) of TEs and 25.72 Mb (7.64%) of tandem repeats.The majority of repeats (28.06%) were Class I retrotransposons, dominated by gypsy (13.04%) and copia (7.52%) elements.Class II DNA transposons accounted for 9.91% of the BO genome (Table 2).

Protein-coding genes prediction.
A total of 30,395 protein-coding genes have been annotated by incorporating de novo, homology, and transcript-based predictions (Table 3).The de novo gene models were predicted using Augustus v.3.2.2 31 and SNAP v.2006-07-28 32 .GeMoMa v.1.7 33 was used for homology-based predictions by annotating the gene models in BO with amino acid sequences from C. grandis, SWO, Poncirus trifoliata, and Arabidopsis thaliana genomes.For transcript-based prediction, RNA-seq data was mapped to the reference genome using HISAT v.2.2.1 34 and quantified with StringTie v.2.1.4 35.Genes were predicted from the assembled transcripts using GeneMarkS-T v.5.1 36 .Another transcript-based prediction method was performed using Trinity v.2.1.1 37 .Program to Assemble Spliced Alignments (PASA) v.2.4.1 38 was used to predict gene models based on the unigenes.The genes predicted in the aforementioned three annotation files were merged using EVidenceModeler v.1.1.1 39 , and the final gene set was updated using PASA v.2.4.1 38 .Each gene exhibited an average of 5.02 exons, with a mean gene length of 3489.94 bp and a coding sequence size of 1152.21 bp.The average lengths of exons and intros were 1440.51 and 2049.43 bp, respectively (Table 3).

Phylogenetic and evolutional analyses.
A phylogenetic tree was constructed using IQ-Tree 48 based on 594 single-copy gene sequences obtained from these 12 species.The alignment of orthologous gene sequence was performed independently using MAFFT v.7.490 49 , followed by the conversion of protein alignments to nucleotide sequence alignments using PAL2NAL v.14 50 .The alignments were then refined using the Gblocks 0.91b 51 .Clean super-alignments were used to construct a maximum likelihood phylogenetic tree using IQ-Tree 48 with a fitted model of GTR + F + I + G4 suggested by ModelFinder 52 .The resulting tree revealed BO is a sister clade to C. sinensis, indicating a closer relationship with SWO than with mandarins (C.unshiu and C. clementina) (Fig. 5a).The divergence time among the 12 plant species was calculated using MCMCTree in the PAML v.4.9 53 with 95% confidence intervals.TimeTree 54 C. sinensis-Arabidopsis thaliana, 90.0-99.9mya.These estimates were subsequently used to correct the fossil time obtained from the software algorithm.Amborella trichopoda was used as the outgroup for conducting maximum-likelihood-based phylogenetic analyses.The divergence time between the SWO and BO (2.24-4.83mya) was comparatively more recent compared than that of C. unshiu and C. clementina (2.33-4.96mya), while the divergence time of oranges and mandarins (2.98-5.94mya) was found to be the earliest among the four Citrus species (Fig. 5a).The gene expansion and contraction of the gene families were determined using Computation Analysis of gene Family Evolution (CAFE) 55 v.3.1.In total, 920 and 1,313 gene families expanded and contracted in the BO genome, respectively (Fig. 5b).

Synteny and whole-genome duplication (WGD) analysis.
To better understand the evolutionary history of BO, we performed a genomic collinearity analysis of BO, SWO, C. clementina, V. vinifera, M. domestica, and Z. jujube.Homologous gene pairs were identified through a comparison of the genomic sequences of two species using the DIAMOND v.0.9.29.130 56 .Subsequently, JCVI v.0.9.13 was used to visualize collinear blocks identified using homologous gene pairs in MCScanX 57 .A significant level of synteny was observed between the genomes of BO and SWO.The BO chromosomes were mapped with more fragments in the SWO than in C. clementina (Fig. 5c).Table 2. Repetitive elements and their proportions in Neixiu blood orange.
To determine the occurrence of WGD events, a combination of the synonymous mutation rate (Ks) and fourfold synonymous third-codon transversion (4DTv) was employed.This analysis was conducted using WGD v.1.1.1 58 and a publicly available script (https://github.com/JinfengChen/Scripts). The 4DTv values of BO, SWO, and C. clementina reached a peak of 0.5, indicating the occurrence of WGD events in Citrus.The Citrus speciation event took place prior to the duplication event observed in Citrus species, evidenced by the pairwise 4DTv distribution of BO compared to M. domestica, V. vinifera, Z. jujuba, and Arabidopsis thaliana (Fig. 5d).

Genome-wide variation analysis.
To investigate the genomic differences between BO and SWO, we used the assembled NX as the reference genome and the most recent chromosome-level phased diploid Valencia SWO genome, as published by Wu et al. 12 , for conducting genome-wide alignments with the nucmer, delter-filter, and show-coord programs from MUMmer v.4.0 59 .This analysis yielded a total of 1,275,362 single-nucleotide polymorphism (SNP) differences and 295,024 insertion-deletions (InDels), including 170,365 insertions and 124,659 deletions.Subsequently, the filtered delta files were subjected to SyRI 60 for the identification of structural variations (SVs) in.A total of 694 copy number variations (CNVs) were found in SWO genome compared to the BO genome, with 362 copies increased and 332 copies lost in number in the SWO genome.Presence-absence variations (PAVs) are major contributors to genome structural variations, impacting both phenotypic and genomic variability 61 .We detected 1,081 present and 1,340 absent variations.GO and KEGG enrichment analyses were conducted using clusterProfiler v.3.14.10 47 for genes where mutations were detected.ANNOVAR 62 was used for the functional annotation of genetic variants.

Data records
The genome sequences, PacBio raw data, and Hic-C raw data have been deposited to the NCBI SRA database 63,64 and the genome gff annotation file was uploaded to 65 .Genome estimation, statistics of assembled genome sequences, integrated function annotation, statistics of gene family clustering, and list of the expanded and constracted gene families were submitted at the Figshare 66 .

technical Validation
The assessment of the final assembled genome completeness and quality involved the implementation of (1) Core Eukaryotic Genes Mapping Approach (CEGMA) v. 2.5 67 , (2) Benchmarking Universal Single-Copy Orthologs (BUSCO) v. 5.2.1 68 , (3) alignment using Burrows-Wheeler Aligner (BWA) 69 with Illumina data, and (4) alignment using Minimap 2 70 with HiFi reads.The evaluation of the final assembled genome's integrity was performed by referencing the CEGMA database, which contains 458 core eukaryotic genes (CEGs) and 248 highly conserved CEGs, and by employing tblastn, genewise, and geneid software 67 .The assembled genome contained 98.25% (450) of CEGs and 95.16% (236) of highly conserved CEGs, suggesting that it contained most CEGs.To evaluate the integrity of the assembly, BUSCO 68 analysis was conducted using the Embryophyta database OrthoDB v. 10 (http://cegg.unige.ch/orthodb), which encompasses 1,614 orthologous single-copy genes.The assembled genome contained 1,585 (98.20%) of these genes.Mapping of Illumina short reads and HiFi long reads to the assembled genome revealed that approximately 97.66% and 99.58% of the reads, respectively, aligned successfully (Table 1).
To ensure the reliability of the MCMCTree analyses, the correlated molecular clock and JC69 model were employed, and all relevant computations were performed twice to ensure consistency.The correlation between two iterations in this test is 1.
In order to evaluate the reliability of the inference in constructing the phylogenetic tree, 1000 bootstrap replicates were performed for each branch.

Fig. 2
Fig.2Frequency distribution of the 19-mer analysis.The x-axis represented the K-mer depth and y-axis represented the frequency of K-mer correspond to the depth.

Fig. 3
Fig.3Hi-C interaction heatmap for Neixiu blood orange.The map shows scaffolded and independently assembled chromsomes at high resolution.

Fig. 4
Fig. 4 Comparative genomic analysis of Neixiu blood orange and other 11 representative plant species.(a) Gene family cluster petal map of Neixiu blood orange and other 11 representative plant species.The central circle represents common gene families, and the outer petals represent specific gene families.(b) Venn diagram showing gene family clusters of five Rutaceae species.(c) The number of gene copies and their distribution among 12 plant species.(d) KEGG enrichment analysis of genes specific to Neixiu blood orange.

Table 1 .
Assembly and assessment of Neixiu blood orange genome.

Table 3 .
Genome annotation of Neixiu blood orange.